#
Dr. M. Baron, Statistical Machine Learning class, STAT-427/627
# DIMENSION REDUCTION AND SHRINKAGE
Part I. Variable Selection and Ridge Regression
1. VARIABLE SELECTION
> attach(Auto)
> library(leaps)
> reg.fit = regsubsets(
mpg ~ cylinders + displacement + horsepower + weight + acceleration + year,
Auto )
> summary(reg.fit)
Selection
Algorithm: exhaustive
cylinders displacement horsepower
weight acceleration year
1 ( 1 ) " " " " " " "*" " " " "
2 ( 1 ) " " " " " " "*" " " "*"
3 ( 1 ) " " " " " " "*" "*" "*"
4 ( 1 ) " " "*" " " "*" "*" "*"
5 ( 1 ) "*" "*" " " "*" "*" "*"
6 ( 1 ) "*" "*" "*" "*" "*" "*"
# This command finds the best model for each p = number of
independent variables. The best model is determined by the lowest RSS.
# Next, choose the best p according to some criteria:
> summary(reg.fit)$adjr2 # Adjusted R2
[1] 0.6918423 0.8071941 0.8071393 0.8067872 0.8067841 0.8062826
> summary(reg.fit)$cp # Mallows Cp
[1] 232.396144 1.169751 2.284200 3.992019 5.000800 7.000000
> summary(reg.fit)$bic # BIC = Bayesian information criterion
[1] -450.5016 -629.3564 -624.2828 -618.6081 -613.6448 -607.6743
# Recall that plain R2 is not a fair measure of performance. It
always increases with p:
> summary(reg.fit)$rsq
[1] 0.6926304 0.8081803 0.8086190 0.8087638 0.8092549 0.8092553
# For stepwise or backward elimination
variable selection, use method=”forward” or method=”backward”.
>
library(MASS)
>
reg = regsubsets( medv ~ .,
data=Boston, method =
"backward" )
# There is a nice way to visualize results, ranking models by the
chosen “scale”. Black color means the variable is included into the model,
white means it is excluded.
>
plot(reg)
>
plot(reg, scale = "adjr2" )
# To see more models, use option “nbest”, which is the number of models of each size p to be
compared.
> reg = regsubsets(
medv ~ ., data=Boston, method = "backward",
nbest=4 )
> plot(reg, scale = "adjr2" )
# We can also choose the best model by means of a stepwise procedure,
starting with one model and ending with another.
> null = lm( medv ~ 1, data=Boston )
> full = lm( medv ~ ., data=Boston )
> step( null, scope=list(lower=null, upper=full),
direction="forward" )
Start: AIC=2246.51
medv ~ 1
Df Sum of
Sq RSS AIC
+ lstat 1
23243.9 19472 1851.0 # Compare contributions of
+ rm 1
20654.4 22062 1914.2 # remaining independent variables
+ ptratio 1
11014.3 31702 2097.6
+ indus 1
9995.2 32721 2113.6
+ tax 1
9377.3 33339 2123.1
+ nox 1
7800.1 34916 2146.5
+ crim 1
6440.8 36276 2165.8
+ rad 1
6221.1 36495 2168.9
+ age 1
6069.8 36647 2171.0
+ zn 1
5549.7 37167 2178.1
+ black 1
4749.9 37966 2188.9
+ dis 1 2668.2 40048 2215.9
+ chas 1
1312.1 41404 2232.7
<none> 42716 2246.5
Step: AIC=1851.01
medv ~ lstat
Df Sum of
Sq RSS AIC
+ rm 1
4033.1 15439 1735.6
+ ptratio 1
2670.1 16802 1778.4
+ chas 1
786.3 18686 1832.2
+ dis 1
772.4 18700 1832.5
+ age 1
304.3 19168 1845.0
+ tax 1
274.4 19198 1845.8
+ black 1
198.3 19274 1847.8
+ zn 1
160.3 19312 1848.8
+ crim 1
146.9 19325 1849.2
+ indus 1
98.7 19374 1850.4
<none> 19472 1851.0
+ rad 1
25.1 19447 1852.4
+ nox 1
4.8 19468 1852.9
... < truncated > ...
Step: AIC=1585.76
medv ~ lstat
+ rm + ptratio + dis + nox
+ chas + black + zn +
crim + rad + tax
Df Sum of
Sq RSS AIC
<none> 11081 1585.8
+ indus 1
2.51754 11079 1587.7
+ age 1
0.06271 11081 1587.8
Call:
lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
black + zn + crim + rad + tax, data = Boston)
Coefficients:
(Intercept) lstat rm ptratio dis nox
36.341145
-0.522553 3.801579 -0.946525
-1.492711 -17.376023
chas black zn crim rad tax
2.718716
0.009291 0.045845 -0.108413 0.299608
-0.011778
#
The final model contains variables lstat,
rm, ptratio, dis, nox, chas, black, zn, crim, rad, and tax.
2. RIDGE REGRESSION
#
Dataset “Boston” has some strong correlations, resulting in multicollinearity.
>
cor(Boston)
#
Apply ridge regression
>
lm.ridge(
medv~., data=Boston, lambda=0.5 )
crim zn indus chas
36.270091954
-0.107524388 0.046092635 0.018815242
2.693518778
nox rm age dis rad
-17.646013524 3.816019080
0.000578687 -1.469191794 0.301928336
tax ptratio black lstat
-0.012134702
-0.950831885 0.009309388 -0.523845421
#
We can see how the slopes change with penalty lambda. These are estimated
slopes for different lambda. When lambda=0, we get LSE. Large lambda forces
them toward 0.
>
rr = lm.ridge(medv ~ ., data=train, lambda=seq(0,1000,1) )
>
rr
>
plot(rr)
#
To choose a good lambda, fit ridge regression with various lambda and compare
prediction performance.
> select(rr)
modified HKB
estimator is 4.321483
modified L-W
estimator is 3.682107
smallest value
of GCV at 5
#
So, the best lambda is around 5. We’ll look closer at that range.
> rr = lm.ridge( medv~., data=Boston, lambda=seq(0,10,0.01) )
> select(rr)
modified HKB
estimator is 4.594163
modified L-W
estimator is 3.961575
smallest value
of GCV at 4.26
> plot(rr$lambda,rr$GCV)